Goto

Collaborating Authors

 separation quality


Unleashing the Power of Natural Audio Featuring Multiple Sound Sources

arXiv.org Artificial Intelligence

Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio, which is critical for artificial auditory perception. However, current methods heavily rely on artificially mixed audio for training, which limits their ability to generalize to naturally mixed audio collected in real-world environments. To overcome this limitation, we propose ClearSep, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios. We introduce two remix-based evaluation metrics to quantitatively assess separation quality and use these metrics as thresholds to iteratively apply the data engine alongside model training, progressively optimizing separation performance. In addition, we propose a series of training strategies tailored to these separated independent tracks to make the best use of them. Extensive experiments demonstrate that ClearSep achieves state-of-the-art performance across multiple sound separation tasks, highlighting its potential for advancing sound separation in natural audio scenarios. For more examples and detailed results, please visit our demo page at https://clearsep.github.io.


Improving Real-Time Music Accompaniment Separation with MMDenseNet

arXiv.org Artificial Intelligence

Music source separation aims to separate polyphonic music into different types of sources. Most existing methods focus on enhancing the quality of separated results by using a larger model structure, rendering them unsuitable for deployment on edge devices. Moreover, these methods may produce low-quality output when the input duration is short, making them impractical for real-time applications. Therefore, the goal of this paper is to enhance a lightweight model, MMDenstNet, to strike a balance between separation quality and latency for real-time applications. Different directions of improvement are explored or proposed in this paper, including complex ideal ratio mask, self-attention, band-merge-split method, and feature look back. Source-to-distortion ratio, real-time factor, and optimal latency are employed to evaluate the performance. To align with our application requirements, the evaluation process in this paper focuses on the separation performance of the accompaniment part. Experimental results demonstrate that our improvement achieves low real-time factor and optimal latency while maintaining acceptable separation quality.


Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model

arXiv.org Artificial Intelligence

We propose Audio-Visual Lightweight ITerative model (AVLIT), an effective and lightweight neural network that uses Progressive Learning (PL) to perform audio-visual speech separation in noisy environments. To this end, we adopt the Asynchronous Fully Recurrent Convolutional Neural Network (A-FRCNN), which has shown successful results in audio-only speech separation. Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality. We evaluated our model in a controlled environment using the NTCD-TIMIT dataset and in-the-wild using a synthetic dataset that combines LRS3 and WHAM!. The experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines. Furthermore, the reduced footprint of our model makes it suitable for low resource applications.


Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech

arXiv.org Artificial Intelligence

Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. Most of these approaches need an additional stitching step to stitch the separated speech chunks for long form audio. Since most of the approaches involve Permutation Invariant training (PIT), the order of separated speech chunks is nondeterministic and leads to difficulty in accurately stitching homogenous speaker chunks for downstream tasks like Automatic Speech Recognition (ASR). Also, most of these models are trained with synthetic mixtures and do not generalize to real conversational data. In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal using an over-clustering based approach. This model naturally regulates the order of the separated chunks without the need for an additional stitching step. We also introduce a data sampling strategy with real and synthetic mixtures which generalizes well to real conversation speech. With this model and data sampling technique, we show significant improvements in speaker-attributed word error rate (SA-WER) on Hub5 data.


Active Audio-Visual Separation of Dynamic Sound Sources

arXiv.org Artificial Intelligence

We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound accurately at every step using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, using self-attention to make high-quality estimates for current timesteps and also simultaneously improve its past estimates. Using highly realistic acoustic SoundSpaces [14] simulations in real-world scanned Matterport3D [12] environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a dynamic audio target.


Model selection for deep audio source separation via clustering analysis

arXiv.org Machine Learning

Audio source separation is the process of separating a mixture (e.g. a pop band recording) into isolated sounds from individual sources (e.g. just the lead vocals). Deep learning models are the state-of-the-art in source separation, given that the mixture to be separated is similar to the mixtures the deep model was trained on. This requires the end user to know enough about each model's training to select the correct model for a given audio mixture. In this work, we automate selection of the appropriate model for an audio mixture. We present a confidence measure that does not require ground truth to estimate separation quality, given a deep model and audio mixture. We use this confidence measure to automatically select the model output with the best predicted separation quality. We compare our confidence-based ensemble approach to using individual models with no selection, to an oracle that always selects the best model and to a random model selector. Results show our confidence-based ensemble significantly outperforms the random ensemble over general mixtures and approaches oracle performance for music mixtures.


Singing Voice Separation Using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy

arXiv.org Artificial Intelligence

Separating a singing voice from its music accompaniment remains an important challenge in the field of music information retrieval. We present a unique neural network approach inspired by a technique that has revolutionized the field of vision: pixel-wise image classification, which we combine with cross entropy loss and pretraining of the CNN as an autoencoder on singing voice spectrograms. The pixel-wise classification technique directly estimates the sound source label for each time-frequency (T-F) bin in our spectrogram image, thus eliminating common pre-and postprocessing tasks. The proposed network is trained by using the Ideal Binary Mask (IBM) as the target output label. The IBM identifies the dominant sound source in each T-F bin of the magnitude spectrogram of a mixture signal, by considering each T-F bin as a pixel with a multi-label (for each sound source). Cross entropy is used as the training objective, so as to minimize the average probability error between the target and predicted label for each pixel. By treating the singing voice separation problem as a pixel-wise classification task, we additionally eliminate one of the commonly used, yet not easy to comprehend, postprocessing steps: the Wiener filter postprocessing. The proposed CNN outperforms the first runner up in the Music Information Retrieval Evaluation eXchange (MIREX) 2016 and the winner of MIREX 2014 with a gain of 2.2702 5.9563 dB global normalized source to distortion ratio (GNSDR) when applied to the iKala dataset. This work is supported by the MOE Academic fund AFD 05/15 SL and SUTD SRG ISTD 2017 129. Corresponding Author D. Herremans Singapore University of Technology and Design, Singapore & Institute for High Performance Computing, A*STAR, Singapore Email: dorien herremans@sutd.edu.sg 1 INTRODUCTION to compete with cutting-edge singing voice separation systems which use multichannel modeling,data augmentation, and model blending. Keywords Singing Voice Separation · Convolutional Neural Network · Ideal Binary Mask · Cross Entropy · Pixel-wise Image Classification 1 Introduction Humans have an exceptional ability to separate different sounds from a musical signal [3]. For instance, some musicians can distinguish the guitar part from a song and transcribe it; and most non-musician listeners are able to hear and sing along to lyrics of a song.


Using recurrences in time and frequency within U-net architecture for speech enhancement

arXiv.org Machine Learning

ABSTRACT When designing fully-convolutional neural network, there is a tradeoff between receptive field size, number of parameters and spatial resolution of features in deeper layers of the network. Inthis work we present a novel network design based on combination of many convolutional and recurrent layers that solves these dilemmas. We compare our solution with U-nets based models known from the literature and other baseline modelson speech enhancement task. We test our solution onTIMIT speech utterances combined with noise segments extractedfrom NOISEX-92 database and show clear advantage of proposed solution in terms of SDR (signal-todistortion ratio),SIR (signal-to-interference ratio) and STOI (spectro-temporal objective intelligibility) metrics compared to the current state-of-the-art. Index Terms-- deep learning, speech enhancement, U-nets 1.INTRODUCTION The single-channel speech enhancement problem is to reduce a noise present in a single-channel recording of speech.


Bayesian Unification of Sound Source Localization and Separation with Permutation Resolution

AAAI Conferences

Sound source localization and separation with permutation resolution are essential for achieving a computational auditory scene analysis system that can extract useful information from a mixture of various sounds. Because existing methods cope separately with these problems despite their mutual dependence, the overall result with these approaches can be degraded by any failure in one of these components. This paper presents a unified Bayesian framework to solve these problems simultaneously where localization and separation are regarded as a clustering problem. Experimental results confirm that our method outperforms state-of-the-art methods in terms of the separation quality with various setups including practical reverberant environments.